[SPARK-49179][SQL] Fix v2 multi bucketed inner joins throw AssertionError by ulysses-you · Pull Request #47683 · apache/spark

ulysses-you · 2024-08-09T08:08:30Z

What changes were proposed in this pull request?

For SMJ with inner join, it just wraps left and right output partitioning to PartitioningCollection so it may not satisfy the target required clustering.

Why are the changes needed?

Fix exception if the query contains multi bucketed inner joins

SELECT * FROM testcat.ns.t1
JOIN testcat.ns.t2 ON t1.id = t2.id
JOIN testcat.ns.t3 ON t1.id = t3.id

Cause: java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:264)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385)
at scala.collection.immutable.List.map(List.scala:247)
at scala.collection.immutable.List.map(List.scala:79)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166)
at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714)
at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882)

Does this PR introduce any user-facing change?

yes, it's a bug fix

How was this patch tested?

add test

Was this patch authored or co-authored using generative AI tooling?

no

ulysses-you · 2024-08-09T08:09:33Z

cc @cloud-fan @yaooqinn thank you

cloud-fan · 2024-08-09T14:15:59Z

Can you get review from the active contributors of SPJ?

ulysses-you · 2024-08-10T01:40:06Z

cc @huaxingao @szehon-ho do you have time to take a look ? thank you

ulysses-you · 2024-08-12T10:31:17Z

also cc @sunchao @viirya thank you

viirya

Makes sense to me.

viirya · 2024-08-12T18:24:17Z

sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala

+      val df = sql(
+        """
+          |SELECT * FROM testcat.ns.t1
+          |JOIN testcat.ns.t2 ON t1.id = t2.id
+          |JOIN testcat.ns.t3 ON t1.id = t3.id
+          |""".stripMargin)
+      assert(collectShuffles(df.queryExecution.executedPlan).isEmpty)


Can we also check the result?

added checkAnswer

…rror ### What changes were proposed in this pull request? For SMJ with inner join, it just wraps left and right output partitioning to `PartitioningCollection` so it may not satisfy the target required clustering. ### Why are the changes needed? Fix exception if the query contains multi bucketed inner joins ```sql SELECT * FROM testcat.ns.t1 JOIN testcat.ns.t2 ON t1.id = t2.id JOIN testcat.ns.t3 ON t1.id = t3.id ``` ``` Cause: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:264) at org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642) at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385) at scala.collection.immutable.List.map(List.scala:247) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382) at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364) at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497) at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689) at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882) ``` ### Does this PR introduce _any_ user-facing change? yes, it's a bug fix ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #47683 from ulysses-you/SPARK-49179. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> (cherry picked from commit 8133294) Signed-off-by: youxiduo <youxiduo@corp.netease.com>

ulysses-you · 2024-08-13T05:10:28Z

thank you all, merged to master/branch-3.5/branch-3.4

dongjoon-hyun · 2024-08-13T05:45:39Z

Hi, @ulysses-you . Unfortunately, This seems to break branch-3.5/branch-3.4.

ulysses-you · 2024-08-13T05:47:25Z

thank you @dongjoon-hyun , will send pr for each branch later

dongjoon-hyun · 2024-08-13T05:47:51Z

Thanks. Sure, take your time.

For now, branch-3.5/3.4 are recovered via reverting.

…rror ### What changes were proposed in this pull request? For SMJ with inner join, it just wraps left and right output partitioning to `PartitioningCollection` so it may not satisfy the target required clustering. ### Why are the changes needed? Fix exception if the query contains multi bucketed inner joins ```sql SELECT * FROM testcat.ns.t1 JOIN testcat.ns.t2 ON t1.id = t2.id JOIN testcat.ns.t3 ON t1.id = t3.id ``` ``` Cause: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:264) at org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642) at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385) at scala.collection.immutable.List.map(List.scala:247) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382) at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364) at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497) at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689) at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882) ``` ### Does this PR introduce _any_ user-facing change? yes, it's a bug fix ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47683 from ulysses-you/SPARK-49179. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> (cherry picked from commit 8133294) Signed-off-by: youxiduo <youxiduo@corp.netease.com>

…tionError backport #47683 to branch-3.4 ### What changes were proposed in this pull request? For SMJ with inner join, it just wraps left and right output partitioning to `PartitioningCollection` so it may not satisfy the target required clustering. ### Why are the changes needed? Fix exception if the query contains multi bucketed inner joins ```sql SELECT * FROM testcat.ns.t1 JOIN testcat.ns.t2 ON t1.id = t2.id JOIN testcat.ns.t3 ON t1.id = t3.id ``` ``` Cause: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:264) at org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642) at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385) at scala.collection.immutable.List.map(List.scala:247) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382) at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364) at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497) at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689) at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882) ``` ### Does this PR introduce _any_ user-facing change? yes, it's a bug fix ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #47736 from ulysses-you/SPARK-49179-3.4. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>

…tionError backport #47683 to branch-3.5 ### What changes were proposed in this pull request? For SMJ with inner join, it just wraps left and right output partitioning to `PartitioningCollection` so it may not satisfy the target required clustering. ### Why are the changes needed? Fix exception if the query contains multi bucketed inner joins ```sql SELECT * FROM testcat.ns.t1 JOIN testcat.ns.t2 ON t1.id = t2.id JOIN testcat.ns.t3 ON t1.id = t3.id ``` ``` Cause: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:264) at org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642) at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385) at scala.collection.immutable.List.map(List.scala:247) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382) at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364) at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497) at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689) at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882) ``` ### Does this PR introduce _any_ user-facing change? yes, it's a bug fix ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #47735 from ulysses-you/SPARK-49179-3.5. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>

Fix v2 multi bucketed inner joins throw AssertionError

f1f05fb

github-actions bot added the SQL label Aug 9, 2024

yaooqinn approved these changes Aug 9, 2024

View reviewed changes

viirya approved these changes Aug 12, 2024

View reviewed changes

viirya reviewed Aug 12, 2024

View reviewed changes

address comments

c72bf21

ulysses-you closed this in 8133294 Aug 13, 2024

ulysses-you deleted the SPARK-49179 branch August 13, 2024 05:10

This was referenced Aug 13, 2024

[SPARK-49179][SQL][3.5] Fix v2 multi bucketed inner joins throw AssertionError #47735

Closed

[SPARK-49179][SQL][3.4] Fix v2 multi bucketed inner joins throw AssertionError #47736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49179][SQL] Fix v2 multi bucketed inner joins throw AssertionError#47683

[SPARK-49179][SQL] Fix v2 multi bucketed inner joins throw AssertionError#47683
ulysses-you wants to merge 2 commits intoapache:masterfrom
ulysses-you:SPARK-49179

ulysses-you commented Aug 9, 2024

Uh oh!

ulysses-you commented Aug 9, 2024

Uh oh!

cloud-fan commented Aug 9, 2024

Uh oh!

ulysses-you commented Aug 10, 2024

Uh oh!

ulysses-you commented Aug 12, 2024

Uh oh!

viirya left a comment

Uh oh!

viirya Aug 12, 2024

Uh oh!

ulysses-you Aug 13, 2024

Uh oh!

ulysses-you commented Aug 13, 2024

Uh oh!

dongjoon-hyun commented Aug 13, 2024

Uh oh!

ulysses-you commented Aug 13, 2024

Uh oh!

dongjoon-hyun commented Aug 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ulysses-you commented Aug 9, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ulysses-you commented Aug 9, 2024

Uh oh!

cloud-fan commented Aug 9, 2024

Uh oh!

ulysses-you commented Aug 10, 2024

Uh oh!

ulysses-you commented Aug 12, 2024

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

viirya Aug 12, 2024

Choose a reason for hiding this comment

Uh oh!

ulysses-you Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Aug 13, 2024

Uh oh!

dongjoon-hyun commented Aug 13, 2024

Uh oh!

ulysses-you commented Aug 13, 2024

Uh oh!

dongjoon-hyun commented Aug 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants